Binary / row helpers #6096

bkirwi · 2024-07-19T21:33:38Z

Which issue does this PR close?

Closes #6063.

(Potentially - still under discussion at the linked issue!)

Rationale for this change

I've added the optional from_binary method discussed in the associated issue also.

What changes are included in this PR?

data, into_binary and from_binary functions, and an extension to the fuzz test that checks the data survives the roundtrip.

Are there any user-facing changes?

Yes, though I suspect the rustdoc covers them enough?

Currently these are accessible via `AsRef`, but that trait only gives you the bytes with the lifetime of the `Row` struct and not the lifetime of the backing data.

XiangpengHao · 2024-07-20T23:53:23Z

arrow-row/src/lib.rs

+    /// Create a [BinaryArray] from the [Rows] data without reallocating the
+    /// underlying bytes.
+    pub fn into_binary(self) -> BinaryArray {
+        assert!(


I wonder if we should return LargeBinaryArray here as the offsets are Vec<usize> (not Vec<u32>)

Open to it! Since array indices are signed ints, we'd need this assert in any case, so I went with what seemed like the most common type. I don't feel especially strongly about it though!

I agree Binary is likely the most common type. We could potentially add a to_large_binary to support converting to LargeBinary 🤔

alamb

Thanks @bkirwi and @XiangpengHao

I think this PR needs some additional negative tests and error testing but otherwise I think it is looking good to me

cc @tustvold in case you have time to comment on the safety of the design

alamb · 2024-07-23T19:32:36Z

arrow-row/src/lib.rs

+            self.buffer.len() <= i32::MAX as usize,
+            "rows buffer too large"
+        );
+        let offsets_scalar = ScalarBuffer::from_iter(self.offsets.into_iter().map(i32::usize_as));


I think this needs to check that the offsets don't overflow a i32 - this should be a try_into I think and the method should be like fn try_into_binary(self) -> Result<BinaryArray>

My belief is that this is guaranteed by the assert above (which asserts that the len is not larger than i32::MAX) and the existing offset invariant (which guarantees that all offsets are valid indices into the binary data). So a more expensive O(n) check seemed redundant.

I'll go ahead and turn that assert into a Result::Err; let me know what you think about the other side of it!

I agree with @bkirwi 's logic here. If we assume that self.offsets is well formed with respect to self.buffer then we shouldn't need to check the individual offsets.

alamb · 2024-07-23T19:34:09Z

arrow-row/src/lib.rs

+    /// Create a [BinaryArray] from the [Rows] data without reallocating the
+    /// underlying bytes.
+    pub fn into_binary(self) -> BinaryArray {
+        assert!(


I agree Binary is likely the most common type. We could potentially add a to_large_binary to support converting to LargeBinary 🤔

alamb · 2024-07-23T19:44:36Z

arrow-row/src/lib.rs

@@ -738,6 +738,23 @@ impl RowConverter {
        }
    }

+    /// Create a new [Rows] instance from the given binary data.
+    pub fn from_binary(&self, array: BinaryArray) -> Rows {


I wonder if this function needs to be marked unsafe -- I am worried that someone inserts invalid data into Rows here (e.g. modifies the bytes to read in invalid UTF8).

However, I see that there is already a way to convert between Rows and [u8] and then from [u8] to Rows (e.g RowParser::parser)

Yeah, that's my understanding! This would definitely be unsafe without the validate_utf8 below... but with it, I believe this has the same safety properties as existing public API.

alamb · 2024-07-23T19:45:28Z

arrow-row/src/lib.rs

+            for (actual, expected) in back.iter().zip(&arrays) {
+                actual.to_data().validate_full().unwrap();
+                dictionary_eq(actual, expected)
+            }


Could you also add some negative tests (like create a BinaryArray from some random bytes and try to convert that back to an array and make sure it panics)?

Sure! I notice there's just one existing test for the parser, for utf8 data; I've matched that and added a couple more tests for interesting cases.

(This seems like great API surface to fuzz... but it's challenging to write a real fuzzer for, since panics are expected and miri is disabled for our existing fuzzer. May be interesting future work!)

alamb · 2024-07-23T19:46:11Z

arrow-row/src/lib.rs

@@ -738,6 +738,23 @@ impl RowConverter {
        }
    }

+    /// Create a new [Rows] instance from the given binary data.


Can you please add a doc example showing how to do this?

I think trying to give a basic example of converting rows, then to/from binary, will not only serve as good documentation it will make sure all the required APIs are pub (for example, I think RowParser needs to be pub..)

Sure, done.

alamb · 2024-07-25T18:56:02Z

Marking as draft so it is clear this PR isn't waiting on feedback anymore (at least I don't think it is). Please mark it as ready for review when it is ready for another look

bkirwi

Thanks for the review! I think I've addressed all comments, though there were a couple things I wasn't certain of - addressed inline.

bkirwi · 2024-08-09T15:14:54Z

arrow-row/src/lib.rs

@@ -738,6 +738,23 @@ impl RowConverter {
        }
    }

+    /// Create a new [Rows] instance from the given binary data.


Sure, done.

bkirwi · 2024-08-09T15:18:00Z

arrow-row/src/lib.rs

@@ -738,6 +738,23 @@ impl RowConverter {
        }
    }

+    /// Create a new [Rows] instance from the given binary data.
+    pub fn from_binary(&self, array: BinaryArray) -> Rows {


Yeah, that's my understanding! This would definitely be unsafe without the validate_utf8 below... but with it, I believe this has the same safety properties as existing public API.

bkirwi · 2024-08-09T15:33:33Z

arrow-row/src/lib.rs

+            self.buffer.len() <= i32::MAX as usize,
+            "rows buffer too large"
+        );
+        let offsets_scalar = ScalarBuffer::from_iter(self.offsets.into_iter().map(i32::usize_as));


My belief is that this is guaranteed by the assert above (which asserts that the len is not larger than i32::MAX) and the existing offset invariant (which guarantees that all offsets are valid indices into the binary data). So a more expensive O(n) check seemed redundant.

I'll go ahead and turn that assert into a Result::Err; let me know what you think about the other side of it!

bkirwi · 2024-08-09T16:28:15Z

arrow-row/src/lib.rs

+            for (actual, expected) in back.iter().zip(&arrays) {
+                actual.to_data().validate_full().unwrap();
+                dictionary_eq(actual, expected)
+            }


Sure! I notice there's just one existing test for the parser, for utf8 data; I've matched that and added a couple more tests for interesting cases.

(This seems like great API surface to fuzz... but it's challenging to write a real fuzzer for, since panics are expected and miri is disabled for our existing fuzzer. May be interesting future work!)

bkirwi · 2024-08-09T17:34:54Z

(Looks like there was some merge skew in the tests; I've merged the main branch in here which ought to fix it.)

alamb · 2024-09-18T20:12:10Z

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

westonpace

A few minor comments but these seem like straightforward changes to me.

westonpace · 2024-09-23T13:18:38Z

arrow-row/src/lib.rs

+            buffer: array.values().to_vec(),
+            offsets: array.offsets().iter().map(|&i| i.as_usize()).collect(),
+            config: RowConfig {
+                fields: Arc::clone(&self.fields),


More for my curiosity than anything but why Arc::clone(&self.fields) instead of self.fields.clone()?

westonpace · 2024-09-23T13:19:57Z

arrow-row/src/lib.rs

+    ///
+    /// // We can convert rows into binary format and back in batch.
+    /// let values: Vec<OwnedRow> = rows.iter().map(|r| r.owned()).collect();
+    /// let binary = rows.try_into_binary().expect("small");


I got a little confused by .expect("small"). What does "small" mean in this context? Why not just .unwrap()?

westonpace · 2024-09-23T13:21:40Z

arrow-row/src/lib.rs

+            self.buffer.len() <= i32::MAX as usize,
+            "rows buffer too large"
+        );
+        let offsets_scalar = ScalarBuffer::from_iter(self.offsets.into_iter().map(i32::usize_as));


I agree with @bkirwi 's logic here. If we assume that self.offsets is well formed with respect to self.buffer then we shouldn't need to check the individual offsets.

westonpace · 2024-09-23T13:28:38Z

(ah, I see rust fmt is failling, probably need CI passing before merge)

alamb · 2024-09-24T15:12:07Z

@bkirwi can you please fix the CI tests so we can merge this PR?

Thank you @westonpace for the review

bkirwi · 2024-09-24T17:12:50Z

Thanks for the review! I should be able to get to the follow-up later this week.

…

On Tue, Sep 24, 2024 at 11:12 Andrew Lamb ***@***.***> wrote: @bkirwi <https://github.com/bkirwi> can you please fix the CI tests so we can merge this PR? Thank you @westonpace <https://github.com/westonpace> for the review — Reply to this email directly, view it on GitHub <#6096 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMFXMZRRAZP444R7XGRAVDZYF6N3AVCNFSM6AAAAABLFIXZL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZRGU4TENJWG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

alamb · 2024-09-24T18:47:15Z

arrow-row/src/lib.rs

+    /// let parsed: Vec<OwnedRow> =
+    ///   binary.iter().flatten().map(|b| parser.parse(b).owned()).collect();
+    /// assert_eq!(values, parsed);
+    /// ```


Maybe it is worth commenting here when this will return an error (aka when the data is more than 2GB)?

alamb · 2024-09-24T18:48:47Z

arrow-row/src/lib.rs

@@ -738,6 +738,42 @@ impl RowConverter {
        }
    }

+    /// Create a new [Rows] instance from the given binary data.
+    ///
+    /// ```


I think it may also be worth adding a doc comment here about when this API will panic (when the data passed in was invalid or empty)

bkirwi added 2 commits July 12, 2024 17:48

Allow access to the underlying bytes of the row

ad3d611

Currently these are accessible via `AsRef`, but that trait only gives you the bytes with the lifetime of the `Row` struct and not the lifetime of the backing data.

Conversions to and from BinaryArray

24a622a

github-actions bot added the arrow Changes to the arrow crate label Jul 19, 2024

bkirwi mentioned this pull request Jul 19, 2024

Make it easier to treat Rows as bytes #6063

Open

XiangpengHao reviewed Jul 20, 2024

View reviewed changes

alamb reviewed Jul 23, 2024

View reviewed changes

alamb marked this pull request as draft July 25, 2024 18:55

bkirwi added 4 commits August 9, 2024 10:54

Clippy fixes

980ec14

Doc comments for binary conversions

6c2a980

Converting to binary now returns a result

3a646e5

Add some negative tests for invalid bytes

c3b6b36

bkirwi commented Aug 9, 2024

View reviewed changes

bkirwi marked this pull request as ready for review August 9, 2024 16:29

Merge remote-tracking branch 'origin/master' into row-data

a941b94

westonpace approved these changes Sep 23, 2024

View reviewed changes

alamb reviewed Sep 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary / row helpers #6096

Binary / row helpers #6096

bkirwi commented Jul 19, 2024

XiangpengHao Jul 20, 2024

bkirwi Jul 22, 2024

alamb Jul 23, 2024

alamb left a comment

alamb Jul 23, 2024

bkirwi Aug 9, 2024

westonpace Sep 23, 2024

alamb Jul 23, 2024

alamb Jul 23, 2024

bkirwi Aug 9, 2024

alamb Jul 23, 2024

bkirwi Aug 9, 2024

alamb Jul 23, 2024 •

edited

Loading

bkirwi Aug 9, 2024

alamb commented Jul 25, 2024

bkirwi left a comment

bkirwi Aug 9, 2024

bkirwi Aug 9, 2024

bkirwi Aug 9, 2024

bkirwi Aug 9, 2024

bkirwi commented Aug 9, 2024

alamb commented Sep 18, 2024

westonpace left a comment

westonpace Sep 23, 2024

westonpace Sep 23, 2024

westonpace Sep 23, 2024

westonpace commented Sep 23, 2024

alamb commented Sep 24, 2024

bkirwi commented Sep 24, 2024 via email

alamb Sep 24, 2024

alamb Sep 24, 2024

Binary / row helpers #6096

Are you sure you want to change the base?

Binary / row helpers #6096

Conversation

bkirwi commented Jul 19, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Jul 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 25, 2024

bkirwi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkirwi commented Aug 9, 2024

alamb commented Sep 18, 2024

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace commented Sep 23, 2024

alamb commented Sep 24, 2024

bkirwi commented Sep 24, 2024 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Jul 23, 2024 •

edited

Loading